Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use utf-8 codec if ensure_ascii is False to avoid UnicodeError #5

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

neuralhax
Copy link

When using latest Cortex version (currently 3.0.0-RC3) which is using cortexutils 2.0.0, I noticed that MaxMind_GeoIP_3_0 analyzer always fails with "Invalid IP address" error, e.g.:

{
  "errorMessage": "Invalid IP address",
  "input": "{\"pap\":2,\"tlp\":2,\"parameters\":{},\"dataType\":\"ip\",\"data\":\"1.1.1.1\",\"message\":\"\",\"config\":{\"check_pap\":true,\"check_tlp\":true,\"proxy_https\":null,\"jobCache\":10,\"max_tlp\":2,\"auto_extract_artifacts\":false,\"cacerts\":null,\"jobTimeout\":30,\"proxy_http\":null,\"max_pap\":2}}",
  "success": false
}

The root cause of that error is following exception:

Traceback (most recent call last):
  File "/opt/Cortex-Analyzers/analyzers/MaxMind/geo.py", line 88, in run
    'traits': self.dump_traits(city.traits)
  File "/usr/local/lib/python2.7/dist-packages/cortexutils/analyzer.py", line 106, in report
    }, ensure_ascii)
  File "/usr/local/lib/python2.7/dist-packages/cortexutils/worker.py", line 178, in report
    self.__write_output(output, ensure_ascii=ensure_ascii)
  File "/usr/local/lib/python2.7/dist-packages/cortexutils/worker.py", line 123, in __write_output
    json.dump(data, f_output, ensure_ascii=ensure_ascii)
  File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
    fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-7: ordinal not in range(128)

Latest Cortex 2.x version using cortexutils 1.3.0 doesn't have such problem. Difference is that in cortexutils 2.0.0, the report is written into the file instead of standard output. Standard output is set to use utf-8 encoding in __set_encoding() function, but the same is not done when writing into the file. Python 2 is using ascii codec by default and when json.dump() function is used with ensure_ascii=False argument (as in this case) and data contains non-ASCII characters as well, it will lead to UnicodeError. Such behaviour is also described in Python json.dump() documentation:

If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only.  If ``ensure_ascii`` is
false, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.

Unless I'm missing something, I believe this can be fixed by using utf-8 encoding for the output file in the same way as is done in __set_encoding() function for sys.stdout and sys.stderr. With this patch, I no longer observe the reported error.

@DarkZatarra
Copy link

I confirm the bug, I had the same problem and now it's fixed

@neuralhax
Copy link
Author

Unless I'm missing something, ...

Well, I have been missing something indeed. The original patch fixed problem with MaxMind analyzer, but introduced problems with other analyzers, e.g. AbuseIPDB, where I started to observe following error:
"No content to map due to end-of-input\n at [Source: (sun.nio.ch.ChannelInputStream); line: 1, column: 0]"
Difference between these two analyzers is that MaxMind uses Python 2 and AbuseIPDB uses Python 3. In Python 2, we can store text either as str type or as unicode type. In Python 3, str replaced unicode and bytes were introduced to replace Python2's str. Non-ASCII characters in Python2's str are by default already encoded in UTF-8, whereas in unicode they're code points not encoded by default. Another difference is that in Python 2 the default encoding for files is ASCII and in Python 3 it's UTF-8.
When doing json.dump(), and string contains non-ASCII Unicode code points, it will fail with UnicodeEncodeError to encode that character using ASCII codec. When the file writer is using UTF-8 codec, it will succeed. However, if string already contained UTF-8 encoded characters, it will fail with UnicodeDecodeError because it assumed ASCII input and those characters couldn't be decoded. This patch f109746 should cover both cases (unless I'm missing something again). However, it will still fail if the written data would contain mix of UTF-8 encoded characters and Unicode code points. Hope it stays as theoretical problem only...

@DarkZatarra
Copy link

@xlaruen you are right. I was able to reproduce the abuseip part and I've applied your commit and it works, for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants